Take-home_Ex03

Author

Xuerong

Introduction

Housing prices are influenced by structural and locational factors. This report explores the use of machine learning and geographically weighted techniques to predict HDB resale prices in Singapore. The focus is on improving prediction accuracy by accounting for spatial heterogeneity.


Data Preparation

Loading Libraries

library(tidyverse)    
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(sf)           
Warning: package 'sf' was built under R version 4.2.3
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(randomForest) 
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin
library(spatialRF)    

Attaching package: 'spatialRF'

The following object is masked from 'package:stats':

    rf
library(caret)        
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
library(tmap)        
Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')
library(xgboost)

Attaching package: 'xgboost'

The following object is masked from 'package:dplyr':

    slice

Importing Cleaned Data

# Load the resale data (Resale.csv)
resale_data <- read_csv("/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Take_home_Ex/Take_home_Ex03b/Resale.csv") %>%
  filter(month >= "2023-01" & month <= "2024-09")
Rows: 192970 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load the coordinates data (coords.rds)
coords <- read_rds("/Users/sharon/OneDrive - Singapore Management University/ISSS626 Data/Take_home_Ex/Take_home_Ex03b/coords.rds")

# Create tidy resale data with appropriate transformations
resale_tidy <- resale_data %>%
  mutate(address = paste(block, street_name)) %>%
  mutate(remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
  mutate(remaining_lease_mth = as.integer(str_sub(remaining_lease, 9, 11)))

# Merge both data frames using a left join on the 'address' column
resale_combined <- resale_tidy %>%
  mutate(address = str_trim(address)) %>%  # Trim whitespace in address
  left_join(coords %>% mutate(address = str_trim(address)), by = "address")

# Convert latitude and longitude to numeric values directly in the mutate step
resale_combined <- resale_combined %>%
  mutate(
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude)
  ) %>%
  filter(!is.na(latitude) & !is.na(longitude))


# Filter out rows with missing latitude or longitude after merging
resale_combined <- resale_combined %>%
  filter(!is.na(latitude) & !is.na(longitude))

Inspecting Data Structure

# Inspect the data structure to verify column names
glimpse(resale_combined)
Rows: 14,649
Columns: 17
$ month               <chr> "2023-01", "2023-01", "2023-01", "2023-01", "2023-…
$ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type           <chr> "2 ROOM", "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block               <chr> "314", "314", "225", "225", "457", "232", "570", "…
$ street_name         <chr> "ANG MO KIO AVE 3", "ANG MO KIO AVE 3", "ANG MO KI…
$ storey_range        <chr> "04 TO 06", "07 TO 09", "04 TO 06", "07 TO 09", "0…
$ floor_area_sqm      <dbl> 44, 44, 67, 67, 89, 67, 67, 67, 67, 73, 68, 67, 81…
$ flat_model          <chr> "Improved", "Improved", "New Generation", "New Gen…
$ lease_commence_date <dbl> 1978, 1978, 1978, 1978, 1980, 1977, 1979, 1979, 19…
$ remaining_lease     <chr> "54 years 01 month", "54 years 01 month", "54 year…
$ resale_price        <dbl> 280000, 282000, 380000, 380000, 425000, 367000, 40…
$ address             <chr> "314 ANG MO KIO AVE 3", "314 ANG MO KIO AVE 3", "2…
$ remaining_lease_yr  <int> 54, 54, 54, 54, 56, 53, 55, 55, 55, 54, 57, 57, 56…
$ remaining_lease_mth <int> 1, 1, 1, 1, 1, 9, 5, 4, 4, 2, NA, 10, 6, 4, 11, 3,…
$ postal              <chr> "560314", "560314", "560225", "560225", "560457", …
$ latitude            <dbl> 1.366227, 1.366227, 1.367396, 1.367396, 1.365698, …
$ longitude           <dbl> 103.8501, 103.8501, 103.8382, 103.8382, 103.8589, …

Filtering and Preprocessing Data

# Focus on four-room flats for this analysis
resale_filtered <- resale_combined %>%
  filter(flat_type == "4 ROOM") %>%
  drop_na()

# Extract numeric values from remaining_lease and storey_range
resale_filtered <- resale_filtered %>%
  mutate(
    remaining_lease_years = as.numeric(str_extract(remaining_lease, '\\d+')),
    floor_area_sqm = as.numeric(floor_area_sqm),
    lease_commence_date = as.numeric(lease_commence_date),
    age = 2024 - lease_commence_date,
    storey_lower = as.numeric(str_extract(storey_range, '\\d+')),
    latitude = as.numeric(latitude),
    longitude = as.numeric(longitude)
  ) %>%
  filter(!is.na(latitude) & !is.na(longitude))

Geospatial Analysis

Creating Spatial Object

# Convert to spatial object
resale_sf <- st_as_sf(resale_filtered, coords = c("longitude", "latitude"), crs = 4326)

# Transform to local coordinate system for distance calculations
resale_sf <- st_transform(resale_sf, crs = 3414)

Adding Distance-Based Features

Since we do not have actual nearby amenities data, we will use dummy features to demonstrate how spatial distances could be integrated into the analysis. If actual data on amenities (e.g., MRT, parks) were available, these features could be calculated accordingly.

# Example: Create random points to simulate amenities (e.g., MRT stations)
set.seed(123)
num_points <- 10
random_coords <- data.frame(
  longitude = runif(num_points, min = 103.6, max = 104.0),
  latitude = runif(num_points, min = 1.2, max = 1.5)
)

# Convert random points to spatial data
amenities_sf <- st_as_sf(random_coords, coords = c("longitude", "latitude"), crs = 4326) %>%
  st_transform(crs = 3414)

# Calculate distance from each property to the nearest "amenity"
resale_sf <- resale_sf %>%
  mutate(dist_to_amenity = st_distance(geometry, amenities_sf) %>% apply(1, min))

Exploratory Data Analysis

Summary Statistics

summary(resale_sf)
    month               town            flat_type            block          
 Length:6010        Length:6010        Length:6010        Length:6010       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 street_name        storey_range       floor_area_sqm    flat_model       
 Length:6010        Length:6010        Min.   : 74.00   Length:6010       
 Class :character   Class :character   1st Qu.: 92.00   Class :character  
 Mode  :character   Mode  :character   Median : 93.00   Mode  :character  
                                       Mean   : 94.01                     
                                       3rd Qu.: 94.00                     
                                       Max.   :132.00                     
 lease_commence_date remaining_lease     resale_price       address         
 Min.   :1969        Length:6010        Min.   : 320000   Length:6010       
 1st Qu.:1997        Class :character   1st Qu.: 530000   Class :character  
 Median :2013        Mode  :character   Median : 595000   Mode  :character  
 Mean   :2007                           Mean   : 637014                     
 3rd Qu.:2018                           3rd Qu.: 690000                     
 Max.   :2020                           Max.   :1450000                     
 remaining_lease_yr remaining_lease_mth    postal         
 Min.   :44.00      Min.   : 1.000      Length:6010       
 1st Qu.:72.00      1st Qu.: 3.000      Class :character  
 Median :88.00      Median : 6.000      Mode  :character  
 Mean   :81.65      Mean   : 6.118                        
 3rd Qu.:93.00      3rd Qu.: 9.000                        
 Max.   :95.00      Max.   :11.000                        
 remaining_lease_years      age         storey_lower             geometry   
 Min.   :44.00         Min.   : 4.00   Min.   : 1.000   POINT        :6010  
 1st Qu.:72.00         1st Qu.: 6.00   1st Qu.: 4.000   epsg:3414    :   0  
 Median :88.00         Median :11.00   Median : 7.000   +proj=tmer...:   0  
 Mean   :81.65         Mean   :17.42   Mean   : 9.045                       
 3rd Qu.:93.00         3rd Qu.:27.00   3rd Qu.:13.000                       
 Max.   :95.00         Max.   :55.00   Max.   :49.000                       
 dist_to_amenity   
 Min.   :   94.72  
 1st Qu.: 3700.55  
 Median : 5558.15  
 Mean   : 5359.85  
 3rd Qu.: 7072.38  
 Max.   :11044.78  

Visualizing Data

Spatial Distribution of Properties

tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(resale_sf) +
  tm_dots(size = 0.1, title = "Properties")

Modeling

Splitting Data

set.seed(123)
training_indices <- createDataPartition(resale_sf$resale_price, p = 0.8, list = FALSE)
train_data <- resale_sf[training_indices, ]
test_data <- resale_sf[-training_indices, ]

Ordinary Least Squares (OLS) Model

# Fit an OLS model using relevant features
ols_model <- lm(resale_price ~ floor_area_sqm + age + storey_lower + dist_to_amenity, data = train_data)
summary(ols_model)

Call:
lm(formula = resale_price ~ floor_area_sqm + age + storey_lower + 
    dist_to_amenity, data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-336748  -70848  -20342   50171  532316 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.879e+05  2.787e+04  21.090  < 2e-16 ***
floor_area_sqm   1.358e+03  2.938e+02   4.623 3.87e-06 ***
age             -4.220e+03  1.264e+02 -33.385  < 2e-16 ***
storey_lower     1.024e+04  2.365e+02  43.279  < 2e-16 ***
dist_to_amenity -1.811e+01  6.943e-01 -26.078  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 111100 on 4804 degrees of freedom
Multiple R-squared:  0.4879,    Adjusted R-squared:  0.4875 
F-statistic:  1144 on 4 and 4804 DF,  p-value: < 2.2e-16

Random Forest Model

# Prepare data for RF
rf_vars <- train_data %>%
  st_drop_geometry() %>%
  select(resale_price, floor_area_sqm, age, storey_lower, dist_to_amenity)

# Train random forest model
rf_model <- randomForest(resale_price ~ ., data = rf_vars, importance = TRUE)
importance(rf_model)
                 %IncMSE IncNodePurity
floor_area_sqm  45.81251  1.137932e+13
age             67.65798  2.950279e+13
storey_lower    74.90955  2.688229e+13
dist_to_amenity 76.54817  2.456592e+13

XGBoost Model

# Prepare data for XGBoost
xgb_train_data <- train_data %>%
  st_drop_geometry() %>%
  select(floor_area_sqm, age, storey_lower, dist_to_amenity, resale_price) %>%
  mutate_if(is.factor, as.numeric) # Convert categorical variables to numeric if present

train_matrix <- as.matrix(xgb_train_data %>% select(-resale_price))
train_labels <- xgb_train_data$resale_price

# Train an XGBoost model
xgb_model <- xgboost(
  data = train_matrix,
  label = train_labels,
  nrounds = 100,
  objective = "reg:squarederror",
  verbose = 0
)

# Make predictions
xgb_test_data <- test_data %>%
  st_drop_geometry() %>%
  select(floor_area_sqm, age, storey_lower, dist_to_amenity) %>%
  mutate_if(is.factor, as.numeric)

test_matrix <- as.matrix(xgb_test_data)
xgb_predictions <- predict(xgb_model, test_matrix)

Model Evaluation

Predictions and Metrics

# OLS predictions
ols_predictions <- predict(ols_model, newdata = test_data)

# Random forest predictions
rf_predictions <- predict(rf_model, newdata = st_drop_geometry(test_data))

# XGBoost predictions
xgb_test_data <- test_data %>%
  st_drop_geometry() %>%
  select(floor_area_sqm, age, storey_lower, dist_to_amenity) %>%
  mutate_if(is.factor, as.numeric)

test_matrix <- as.matrix(xgb_test_data)
xgb_predictions <- predict(xgb_model, test_matrix)

# Define evaluation function
evaluate_model <- function(actual, predicted) {
  data.frame(
    RMSE = sqrt(mean((predicted - actual)^2)),
    R2 = cor(predicted, actual)^2
  )
}

# Evaluate models
ols_metrics <- evaluate_model(test_data$resale_price, ols_predictions)
rf_metrics <- evaluate_model(test_data$resale_price, rf_predictions)
xgb_metrics <- evaluate_model(test_data$resale_price, xgb_predictions)

# Combine results
model_comparison <- rbind(
  OLS = ols_metrics,
  Random_Forest = rf_metrics,
  XGBoost = xgb_metrics
)
model_comparison
                   RMSE        R2
OLS           107749.37 0.4917547
Random_Forest  69747.05 0.7938445
XGBoost        53015.46 0.8771559

Results and Discussion

Model Comparison

This analysis demonstrates the utility of machine learning methods such as XGBoost, random forest, and OLS regression for predicting HDB resale prices. The results show that:

  • XGBoost is the most effective model, achieving the lowest RMSE (53,015.46) and the highest R² (0.8772). This indicates that XGBoost captures complex, non-linear relationships in the data very effectively.
  • Random Forest also performs well, with a lower RMSE (70,175.24) compared to OLS and an R² of 0.7909, making it a good option for moderately complex relationships.
  • OLS Regression is the simplest model and performs the worst, indicating it is less suited to capturing the nuanced factors affecting HDB resale prices.

Future studies could further enhance this analysis by integrating real locational factors such as proximity to amenities, MRT stations, and other neighborhood characteristics to improve prediction accuracy even further. Future studies could further enhance this analysis by integrating real locational factors such as proximity to amenities.